05. Data Cleaning Process

The Process

The very first thing to do before any cleaning occurs is to make a copy of each piece of data. All of the cleaning operations will be conducted on this copy so you can still view the original dirty and/or messy dataset later. Copying DataFrames in pandas is done using the copy method . If the original DataFrame was called df , the soon-to-be clean copy of the dataset could be named df_clean .

df_clean = df.copy()

Note that simply assigning a DataFrame to a new variable name leaves the original DataFrame vulnerable to modifications, as explained in the answers to this Stack Overflow question: " Why should I make a copy of a DataFrame in pandas? "

Data Cleaning Process

An Example

Note: a copy of the original dataset was not made before cleaning in the following example, though one should have been.

Data Cleaning Process

Quiz

Using the snapshot of the patients table and the output of patients.info() below, answer the following matching quiz.

Snapshot of the *patients* table

Snapshot of the patients table

Output of `patients.info()`

Output of patients.info()

Data Cleaning Process

QUIZ QUESTION: :

Match each statement below to the appropriate step of the data cleaning process for the zip code issues in the patients_clean table (a copy of the patients table).

ANSWER CHOICES:



Data Cleaning Step

Statement

patients_clean.zip_code.head()

Convert the zip code column's data type from a float to a string using astype , remove the '.0' using string slicing, and pad four digit zip codes with a leading 0

Zip code has four digits sometimes

Zip code is a float not a string

patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2].str.pad(5, fillchar='0')

SOLUTION:

Data Cleaning Step

Statement

patients_clean.zip_code.head()

Convert the zip code column's data type from a float to a string using astype , remove the '.0' using string slicing, and pad four digit zip codes with a leading 0

patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2].str.pad(5, fillchar='0')